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This  report  describes  the  research  supported  by  the  Air  Force  Office  of  Scientific 
Research  grant  AFOSR-78-3581,  during  the  period  March  1,  1978  to  December  31, 
1982.  The  main  thrust  of  the  research  was  the  design  of  “PASM.” 

PASM  is  a  large-scale  multimicroprocessor  system  being  designed  at  Purdue 
University  for  image  processing  and  pattern  recognition.  This  system  can  be  dynami¬ 
cally  reconfigured  to  operate  as  one  or  more  independent  SIMD  and/or  MIMD 
machines.  PASM  consists  of  a  parallel  computation  unit,  which  contains  N  processors, 
N  memories,  and  an  interconnection  network;  Q  micro  controllers,  each  of  which  con¬ 
trols  N/Q  processors;  N/Q  parallel  secondary  storage  devices;  a  distributed  memory 
management  system;  and  a  system  control  unit,  to  coordinate  the  other  system  com¬ 
ponents.  Possible  values  for  N  and  Q  are  1024  and  16,  respectively. 

This  report  consists  of  two  parts.  The  first  is  an  overview  of  the  PASM  system.  It 
is  a  preprint  of  a  paper  to  appear  as  a  chapter  in  Computer  Architectures  for  Spatially 
Distributed  Data,  H.  Freeman  and  G.  G.  Pieroni,  editors,  Springer-Verlag,  New  York, 
NY,  1983.  In  this  paper  the  interconnection  network,  control  schemes,  and  memory 
management  in  PASM  are  described.  Examples  of  how  PASM  can  be  used  to  perform 
image  processing  tasks  are  also  given.  The  second  part  of  the  final  report  is  a  list  of 
the  53  publications  that  describe  in  detail  the  research  that  was  supported  by  this 
grant. 
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One  way  to  do  image  processing  faster  is  through  the  use  of  paral¬ 
lelism.  Different  modes  of  parallelism  can  be  employed  in  a  computer 
system.  The  SIMP  (single  instruction  stream  -  multiple  data  stream) 
mode  [9]  typically  uses  a  set  of  N  processors,  N  memories,  an  inter¬ 
connection  network,  and  a  control  unit  (e.g.,  Illiac  IV  [6],  STARAN 
15],  CLIP4  [8],  MPP  [16]).  The  control  unit  broadcasts  instructions 
to  the  processors  and  all  active  ("enabled")  processors  execute  the 
same  instruction  at  the  same  time.  Each  processor  executes  instruc¬ 
tions  using  data  taken  from  a  memory  with  which  only  it  is  associated. 

The  interconnection  network  allows  interprocessor  communication.  An 
MSIMD  (multiple-SIMD)  system  is  a  parallel  processing  system  which  can 
be  structured  as  one  or  more  independent  SIMD  machines  (e.g.,  MAP 
[13]).  The  Illiac  IV  was  originally  designed  as  an  MSIMD  system  [3]. 

The  MIMD  (multiple  instruction  stream  -  multiple  data  stream)  mode  [9] 


typically  consists  of  N  processors  and  N  memories,  where  each  proces¬ 
sor  can  follow  an  independent  instruction  stream  (e.g.,  C.mmp  [27], 
Cm*  [25]).  As  with  SIMD  architectures,  there  is  a  multiple  data 
stream  and  an  interconnection  network.  A  partitionable  SIMD/MIMD 
system  is  a  parallel  processing  system  which  can  be  structured  as  one 
or  more  independent  SIMD  and/or  MIMD  machines  (e.g.,  TRAC  [17]). 
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In  this  paper,  the  organization  of  PASM  [20],  a  partitionable 
SI MO/ MIND  system  being  designed  at  Purdue  University,  is  overviewed. 
Example  parallel  image  processing  algorithms  for  use  on  PASM  are 
given. 

PASM  is  to  be  a  large-scale  dynamically  reconf igurable  multimi¬ 
croprocessor  system.  It  is  a  special-purpose  system  aimed  at  exploit¬ 
ing  the  parallelism  of  image  processing  and  pattern  recognition  tasks. 
PASM  can  be  partitioned  so  that  it  operates  as  many  independent  SIMD 
and/or  MIMD  machines  of  various  sizes,  and  it  is  being  developed  using 
a  variety  of  problems  in  image  processing  and  pattern  recognition  to 
guide  the  design  choices.  It  can  also  be  applied  to  related  areas 
such  as  speech  processing  and  biomedical  signal  processing. 

PASM  is  to  serve  as  a  research  tool  for  experimenting  with  parallel 
processing.  The  design  attempts  to  incorporate  the  needed  flexibility 
for  studying  large-scale  SIMD  and  MIMD  parallelism,  while  keeping  sys¬ 
tem  costs  "reasonable."  Portions  of  PASM  have  been  simulated  and  a 
prototype  is  planned  for  the  near  future. 

In  section  2,  the  PASM  organization  is  overviewed.  Section  3 
describes  the  Parallel  Computation  Unit.  The  Micro  Controllers  are 
discussed  in  section  4.  In  section  5,  the  secondary  memory  system  is 
explored.  Parallel  algorithms  for  computing  global  histograms  and  2-D 
FFTs  are  given  in  sections  6  and  7,  respectively. 


PASM  Organization 


A  block  diagram  of  the  basic  components  of  PASM  is  shown  in  Fig.  1. 
The  Parallel  Computation  Unit  (PCU)  contains  N*2n  processors,  N  memory 
modules,  and  an  interconnection  network.  The  PCU  processors  are  mi¬ 
croprocessors  that  perform  the  actual  SIMD  and  MIMD  computations.  The 
PCU  memory  modules  are  used  by  the  PCU  processors  for  data  storage  in 
SIMD  mode  and  both  data  and  instruction  storage  in  MIMD  mode.  Thus, 
each  PCU  processor  can  operate  in  both  the  SIMD  and  MIMD  modes  of 
parallelism.  The  interconnection  network  provides  a  means  of  communi¬ 
cation  among  the  PCU  processors  and  memory  modules. 

The  Micro  Controllers  (MCs)  are  a  set  of  microprocessors  which  act 
as  the  control  units  for  the  PCU  processors  in  SIMD  mode  and  orches¬ 
trate  the  activities  of  the  PCU  processors  in  MIMD  mode.  There  are 
0*2q  MCs.  Each  MC  controls  N/Q  PCU  processors.  A  virtual  SIMD 


machine  (partition)  of  size  RN/Q,  where  R=2r  and  l<r£q»  is  obtained  by 
loading  R  MC  memory  modules  with  the  same  instructions  simultaneously. 
Similarly,  a  virtual  NIMD  machine  of  size  RN/Q  is  obtained  by  combin¬ 
ing  the  efforts  of  the  PCO  processors  of  R  MCs.  Q  is  therefore  the 
maximum  number  of  partitions  allowable,  and  N/Q  is  the  size  of  the 
smallest  partition.  Possible  values  for  N  and  Q  are  1024  and  32, 
respectively.  Control  Storage  contains  the  programs  for  the  MCs. 

The  Memory  Storage  System  provides  secondary  storage  space  for  the 
data  files  in  SIMD  mode,  and  for  the  data  and  program  files  in  MIMD 
mode.  Multiple  storage  devices  are  used  in  the  Memory  Storage  System 
to  allow  parallel  data  transfers.  The  Memory  Management  System  con¬ 
trols  the  transferring  of  files  between  the  Memory  Storage  System  and 
the  PCU  memory  modules.  It  employs  a  set  of  cooperating  dedicated  mi¬ 
croprocessors. 

The  System  Control  Unit  is  a  conventional  machine,  such  as  a 
PDP-11,  and  is  responsible  for  the  overall  coordination  of  the  activi¬ 
ties  of  the  other  components  of  PASM.  The  types  of  tasks  the  System 
Control  Unit  will  perform  include  program  development,  job  scheduling, 
and  coordination  of  the  loading  of  the  PCU  memory  modules  from  the 
Memory  Storage  System  with  the  loading  of  the  MC  memory  modules  from 
Control  Storage.  By  carefully  choosing  which  tasks  should  be  assigned 
to  the  System  Control  Unit  and  which  should  be  assigned  to  other  sys¬ 
tem  components,  the  System  Control  Unit  can  work  effectively  and  not 
become  a  bottleneck. 

Sections  3  through  5  provide  more  information  about  the  PASM  sys¬ 
tem.  References  for  further  reading  about  PASM  appear  at  the  end  of 
this  paper. 


The  Parallel  Computation  Unit  (PCU)  is  shown  in  Fig.  2.  A  memory 
module  is  connected  to  each  processor  to  form  a  processor  -  memory 
pair  called  a  Processing  Element  (PE) .  The  N  PEs  are  numbered  from  0 
to  N-l  and  each  PE  knows  its  number  (address).  The  interconnection 
network  is  used  for  communications  among  PEs.  A  pair  of  memory  units 
is  used  for  each  memory  module.  This  double-buffering  scheme  allows 
data  to  be  moved  between  one  memory  unit  and  secondary  storage  (the 
Memory  Storage  System)  while  the  processor  operates  on  data  in  the 
other  memory  unit. 

The  PCU  processors  will  be  specially  designed  for  parallel  image 
processing.  A  PASM  prototype  (for  N«16,  Q=4)  has  been  designed  based 
on  Motorola  MC68000  processors.  The  final  (N>1024)  system  would  most 
likely  employ  custom  VLSI  processors. 

Two  types  of  multistage  interconnection  networks  are  being  con¬ 
sidered  for  PASM:  the  Generalized  Cube  [19]  and  the  Augmented  Data 
Manipulator  (ADM)  [18].  Features  of  the  Generalized  Cube  network  will 
be  described  to  familiarize  the  readers  with  the  properties  of  multi¬ 
stage  networks. 


Fig.  2 


Parallel  Computation  Unit  (PCU) 


The  Generalized  Cube  network  is  a  multistage  cube-type  network  to¬ 
pology  which  was  introduced  as  a  standard  for  comparing  network  topo¬ 
logies.  Other  multistage  cube-type  networks  include  the  baseline 
[26] ,  delta  [14],  Extra  Stage  Cube  [1],  indirect  binary  n-cube  [15], 
omega  [12],  STARAN  flip  [4],  and  SW-banyan  (S»F»2)  [10].  The  Cube  has 
N  inputs  and  N  outputs.  It  is  shown  in  Fig.  3  for  N»8.  PE  i,  0£i<N, 
would  be  connected  to  input  port  i  and  output  port  i  of  the  unidirec¬ 
tional  network. 
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Fig.  3.  Generalized  Cube  topology,  shown  for  N«8. 

The  Generalized  Cube  topology  has  n  *  log2N  stages,  where  each 
stage  consists  of  a  set  of  N  lines  connected  to  N/2  interchange  boxes. 
Each  interchange  box  is  a  two-input,  two-output  device.  The  labels  of 
the  input/output  (I/O)  lines  entering  the  upper  and  lower  inputs  of  an 
interchange  box  are  used  as  the  labels  for  the  upper  and  lower  out¬ 
puts,  respectively.  Each  interchange  box  can  be  set  to  one  of  the 
four  legitimate  states  shown  in  Fig.  3. 

The  connections  in  this  network  are  based  on  the  cube  interconnec¬ 
tion  functions  [21,  22).  Let  P  ■  Pn_i***PiPg  be  the  binary  represen¬ 
tation  of  an  arbitrary  I/O  line  label.  Then  the  n  cube  interconnec¬ 
tion  functions  can  be  defined  as: 

cub€i(pn-l* *^plp0)  “  pn-l *  *  *pi+lpipi-l *  * ,plp0 
where  (Ki<n,  (KP<N,  and  p^  denotes  the  complement  of  p^.  This  means 

that  the  cube^  interconnection  function  connects  P  to  cube^(P),  where 

cube.(P)  is  the  I/O  line  whose  label  differs  from  P  in  just  the  i-th 


Fig.  4 
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Three-dimensional  cube  structure#  with  vertices  labeled 
from  0  to  7  in  binary. 

bit  position.  Stage  i  of  the  Generalized  Cube  topology  contains  the 
cube^  interconnection  function,  i.e.#  it  pairs  I/O  lines  that  differ 
only  in  the  i-th  bit  position. 

The  reason  that  these  interconnections  are  referred  to  as  cube  con¬ 
nections  can  be  seen  by  considering  the  case  for  N*8.  This  is  shown 
in  Fig.  4.  The  eight  vertices  can  be  labeled  so  each  vertex  is  con¬ 
nected  to  the  n»3  vertices  that  differ  from  it  in  just  one  bit  posi¬ 
tion.  The  horizontal  connections  are  cubeQ,  the  diagonals  are  cube^, 
and  the  verticals  are  cube2. 

Using  routing  tags  (as  headers  on  messages)  allows  network  control 
to  be  distributed  among  the  PEs.  The  routing  tags  for  one-to-one  data 
transfers  consist  of  n  bits.  If  certain  broadcast  capabilities  are 
included#  then  2n  bits  are  used.  The  routing  tags  set  the  state  of 
each  interchange  box  individually. 

The  n-bit  routing  tag  for  one-to-one  connections  is  computed  from 
the  input  port  number  and  desired  output  port  number.  Let  S  be  the 
source  address  (input  port  number)  and  D  be  the  destination  address 
(output  port  number) .  Then  the  routing  tag  T  ■  S®p  (where  "®"  means 
bitwise  "exclusive-or").  Let  be  the  binary  representation 

of  T.  An  interchange  box  in  the  network  at  stage  i  need  only  examine 
t^.  If  t^*l,  an  exchange  is  performed,  and  if  t^*0,  the  straight  con¬ 
nection  is  used.  For  example#  if  N-8#  S*011,  and  D-110,  then  T*101. 
The  corresponding  stage  settings  are  exchange#  straight#  exchange. 
Because  the  exclusive-or  operation  is  commutative#  the  incoming  rout¬ 
ing  tag  is  the  same  as  the  return  tag.  Since  the  destination  PE  has 
the  routing  tag  to  the  source  PE,  it  is  easy  to  perform  handshaking  if 
desired.  The  address  of  the  source  PE  can  be  computed  by  the  destina¬ 
tion  PE  using  S  »  D®T. 

Routing  tags  that  can  be  used  for  broadcasting  data  are  an  exten¬ 
sion  of  the  above  scheme.  They  are  described  in  [19). 

The  Cube  network  can  be  partitioned  into  independent  subnetworks  of 
varying  sizes.  The  par t i t i onahi 1 i ty  of  a  network  is  its  ability  to 
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divide  the  system  into  independent  subsystems  of  different  sizes. 
Furthermore,  in  this  case,  each  subnetwork  of  size  N'£N  will  have  all 
of  the  connection  properties  of  a  Cube  network  built  to  be  of  size  N'. 

The  key  to  partitioning  the  Cube  network  so  that  each  subnetwork  is 
independent  is  based  on  the  choice  of  the  I/O  ports  that  belong  to  the 
subnetworks.  The  requirement  is  that  the  addresses  of  all  of  the  I/O 
ports  in  a  partition  of  size  21  agree  (have  the  same  values)  in  n-i  of 
their  bit  positions. 

For  example.  Fig.  5  shows  one  way  a  network  of  size  eight  can  be 
partitioned  into  two  subnetworks,  each  of  size  four.  Group  A  consists 
of  ports  0,  2,  4,  and  6.  Group  B  consists  of  ports  1,  3,  5,  and  7. 

All  ports  in  group  A  agree  in  the  low-order  bit  position  (it  is  a  0). 
All  ports  in  group  B  agree  in  the  low-order  bit  position  (it  is  a  1). 
By  setting  all  of  the  interchange  boxes  in  stage  0  to  straight,  the 
two  groups  are  isolated.  This  is  because  stage  0  is  the  only  stage 
which  allows  input  ports  which  differ  in  their  low-order  bit  to  ex¬ 
change  data.  As  stated  above,  each  subnetwork  has  the  properties  of  a 
Cube  network.  Thus,  each  subnetwork  can  be  separately  further  subdi¬ 
vided,  resulting  in  subnetworks  of  various  sizes.  This  network  pro¬ 
perty  allows  the  PASM  PCU  PEs  to  be  partitioned  into  independent  vir¬ 
tual  machines  of  various  sizes. 


Fig.  5.  Cube  network  of  size  eight  partitioned  into  two  subnetworks 
of  size  four  based  on  the  low-order  bit  position. 

The  routing  tag  scheme  discussed  previously  can  be  used  in  conjunc¬ 
tion  with  the  partitioning  concepts.  Tags  can  be  logically  AND-ed 
with  masks  to  force  to  0  tag  positions  which  correspond  to  interchange 
boxes  which  should  be  forced  to  the  straight  state. 


The  tradeoffs  between  the  Cube  and  ADM  multistage  networks  for  PASM 
are  currently  under  study.  The  ADM  network  is  more  flexible,  but  is 
more  complex.  The  Cube  may  be  more  cost  effective  and  sufficient  for 
the  system's  needs.  The  Extra  Stage  Cube  network  [1]  is  a  fault- 
tolerant  variation  of  the  Cube  which  is  planned  for  inclusion  in  the 
PASM  prototype. 

In  the  following  sections,  it  will  be  assumed  that  the  PEs  will  be 
partitioned  such  that  their  addresses  agree  in  the  low-order  bit  posi¬ 
tions.  This  constraint  will  allow  either  the  Cube  or  ADM  network  to 
be  used  as  the  partitionable  interconnection  network  in  PASM. 


£.  Micro  Controllers 

In  general,  the  possible  advantages  of  a  partitionable  system  in¬ 
clude: 

(a)  fault  tolerance  -  If  a  single  PE  fails,  only  those  virtual 
machines  (partitions)  which  must  include  the  failed  PE  need  to  be 
disabled.  The  rest  of  the  system  can  continue  to  function. 

(b)  multiple  simultaneous  users  -  Since  there  can  be  multiple  indepen¬ 
dent  virtual  machines,  there  can  be  multiple  simultaneous  users  of 
the  system,  each  executing  a  different  program. 

(c)  program  development  -  Rather  than  trying  to  debug  a  program  on, 
for  example,  1024  PEs,  it  can  be  debugged  on  a  smaller  size  virtu¬ 
al  machine  of  32  PEs. 

(d)  variable  machine  size  for  efficiency  -  If  a  task  requires  only  N/2 
of  N  available  PEs,  the  other  N/2  can  be  used  for  another  task. 

(e)  subtask  parallelism  -  Two  independent  subtasks  that  are  part  of 
the  same  job  can  be  executed  in  parallel,  sharing  results  if 
necessary. 

Some  form  of  multiple  control  units  must  be  provided  in  order  to 
have  a  partitionable  SIMD/MIMD  system.  In  PASM,  this  is  done  by  hav¬ 
ing  Q«2q  MCs ,  physically  addressed  (numbered)  from  0  to  Q-l.  Each  MC 
controls  N/Q  PCU  processors,  as  shown  in  Fig.  6. 

Each  MC  is  a  microprocessor  attached  to  a  memory  module.  A  memory 
module  consists  of  a  pair  of  memory  units  so  that  memory  loading  and 
computations  can  be  overlapped.  In  SIMD  mode,  each  MC  fetches  in¬ 
structions  from  its  memory  module,  executing  the  control  flow  instruc¬ 
tions  (e.g.  branches)  and  broadcasting  the  data  processing  instruc- 
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Fig.  6.  PASM  Micro  Controllers  (MCs) . 

tions  to  its  PCU  processors.  The  physical  addresses  of  the  N/Q  pro¬ 
cessors  which  are  connected  to  an  MC  must  all  have  the  name  low-order 
q  bits  so  that  the  network  can  be  partitioned.  The  value  of  these 
low-order  q  bits  is  the  physical  address  of  the  MC.  A  virtual  SIMD 
machine  of  size  RN/Q,  where  R=2r  and  0<r<q,  is  obtained  by  loading  R 
MCs  with  the  same  instructions  and  synchronizing  the  MCs.  The  physi¬ 
cal  addresses  of  these  MCs  must  have  the  same  low-order  q-r  bits  so 
that  all  of  the  PCU  processors  in  the  partition  have  the  same  low- 
order  q-r  physical  address  bits.  Similarly,  a  virtual  MIMD  machine  of 
size  RN/Q  is  obtained  by  combining  the  efforts  of  the  PCU  PEs  associ¬ 
ated  with  R  MCs  which  have  the  same  low-order  q-r  physical  address 
bits.  In  MIMD  mode,  the  MCs  may  be  used  to  help  coordinate  the  ac¬ 
tivities  of  their  PCU  PEs. 

Permanently  assigning  a  fixed  number  of  PCU  PEs  to  each  MC  has 
several  advantages  over  allowing  a  varying  assignment,  such  as  used  in 
MAP.  One  advantage  is  that  the  operating  system  need  only  schedule 
(and  monitor  the  "busy"  status  of)  Q  MCs,  rather  than  N  PCU  PEs.  When 
Q*32  and  N*1024,  this  is  a  substantial  savings.  Another  advantage  is 
that  no  crossbar  switch  is  needed  for  connecting  processors  and  con¬ 
trol  units  (such  as  proposed  for  MAP  [13]).  A  third  advantage  is  that 
it  supoorts  network  partitioning.  In  addition,  this  fixed  connection 
suitine  allows  the  efficient  use  of  multiple  secondary  storage  devices, 


which  is  discussed  below.  The  main  disadvantage  of  this  approach  is 
that  each  virtual  machine  size  must  be  a  power  of  two,  with  a  minimum 
value  of  N/Q.  However,  for  PASM's  intended  experimental  environment, 
flexibility  at  reasonable  cost  is  the  goal,  not  maximum  processor 
utilization. 

The  loading  of  programs  from  Control  Storage  into  the  MC  memory  un¬ 
its  is  controlled  by  the  System  Control  Unit.  When  large  SIMD  jobs 
are  run,  that  is,  jobs  which  require  more  than  N/Q  processors,  more 
than  one  MC  executes  the  same  set  of  instructions.  Each  MC  has  its 
own  memory,  so  that  if  more  than  one  MC  is  to  be  used,  several 
memories  must  be  loaded  with  the  same  set  of  instructions.  The 
fastest  way  to  load  several  MC  memories  with  the  same  set  of  instruc¬ 
tions  is  to  load  all  of  the  memories  at  the  same  time.  A  shared  bus 
from  Control  Storage  is  used  to  do  this  parallel  loading. 

This  basic  MC  organization  can  be  enhanced  to  allow  the  sharing  of 
memory  modules  by  the  MCs  in  a  partition.  The  MCs  can  be  connected  by 
a  shared  reconf igurable  ( "shortable")  bus  [2,  11],  as  shown  in  Fig.  7. 
The  MCs  must  be  ordered  on  the  bus  in  terms  of  the  bit  reverse  of 
their  addresses  due  to  the  partitioning  rules.  This  enhanced  MC  con¬ 
nection  scheme  could  provide  more  program  space  for  jobs  using  multi- 
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Reconf igurable  shared  bus  scneme  for  interconnecting  MC 
processors  and  MC  memory  modules,  shown  for  Q=8.  Each  box 
can  be  set  to  "through"  or  "short." 


pie  NCs  and  would  also  provide  a  degree  of  fault  tolerance,  since 
known-faulty  MC  memory  modules  could  be  ignored.  These  advantages 
come  at  the  expense  of  additional  system  complexity,  and  the  inclusion 
of  the  enhanced  scheme  in  PASM  will  depend  on  cost  constraints  at  im¬ 
plementation  time. 

Within  each  partition  the  PCU  processors  and  memory  modules  are  as¬ 
signed  logical  addresses.  Given  a  virtual  machine  of  size  RN/Q,  the 
processors  and  memory  modules  for  this  partition  have  logical  ad¬ 
dresses  (numbers)  0  to  (RN/QJ-1,  R-2r,  (kr£q.  The  logical  number  of  a 
PCU  PE  is  the  high-order  r+n-q  bits  of  its  physical  number.  Similar¬ 
ly,  the  NCs  assigned  to  the  partition  are  logically  numbered  (ad¬ 
dressed)  from  0  to  R-l.  For  R>1,  the  logical  number  of  an  MC  is  the 
high-order  r  bits  of  its  physical  number.  The  PASM  language  compilers 
and  operating  system  will  be  used  to  convert  from  logical  to  physical 
addresses,  so  a  system  user  will  deal  only  with  logical  addresses. 

There  are  instructions  which  examine  the  collective  status  of  all 
of  the  PEs  of  a  virtual  SIMD  machine,  such  as  "if  any,"  "if  all,"  and 
"if  none."  These  instructions  change  the  flow  of  control  of  the  pro¬ 
gram  at  execution  time  depending  on  whether  any  or  all  processors  in 
the  virtual  SIMD  machine  satisfy  some  condition.  For  example,  if  each 
PE  is  processing  data  from  a  different  section  of  a  radar  unit,  but 
all  PEs  are  looking  for  enemy  planes,  it  is  desirable  to  know  "if  any" 
of  the  PEs  has  discovered  a  possible  attack.  This  requires  communica¬ 
tion  among  the  MCs  comprising  the  virtual  SIMD  machine.  There  is  a 
set  of  buses  shared  by  MCs  for  this  purpose. 

When  operating  in  SIMD  mode,  all  of  the  active  PCU  PEs  will  execute 
instructions  broadcast  to  them  by  their  MC.  A  masking  scheme  is  a 
method  for  determining  which  PCU  PEs  will  be  active  at  a  given  point 
in  time.  PASM  will  use  PE  address  masks  and  data  conditional  masks. 

The  PE  address  masking  scheme  uses  an  n-position  mask  to  specify 
which  of  the  N  PCU  PEs  are  to  be  activated.  Each  position  of  the  mask 
corresponds  to  a  bit  position  in  the  addresses  of  the  PEs.  Each  posi¬ 
tion  of  the  mask  will  contain  either  a  0,  1,  or  X  ("don't  care")  and 
the  only  PEs  that  will  be  active  are  those  whose  address  matches  the 
mask:  0  matches  0,  1  matches  1,  and  either  0  or  1  matches  X.  Square 
brackets  denote  a  mask.  Superscripts  are  used  as  repetition  factors. 
For  example:  MASK  (Xn“^lJ  activates  all  odd-numbered  PEs;  MASK 
(ln_1X1]  activates  PEs  N-21  to  N-l.  PE  address  masks  are  specified  in 
the  SIMD  program. 

A  negative  PE  address  mask  is  similar  to  a  regular  PE  address  mask, 
except  that  it  activates  all  those  PEs  which  do  not  match  the  mask. 


Negative  PE  address  masks  are  prefixed  with  a  minus  sign  to  distin¬ 
guish  them  from  regular  PE  address  masks.  For  example,  for  N«8,  MASK 
[-01X]  activates  all  PEs  except  2  and  3.  This  type  of  mask  can  ac¬ 
tivate  sets  of  PEs  a  single  regular  PE  address  mask  cannot. 

Data  conditional  masks  will  be  implemented  in  PASM  for  use  when  the 
decision  to  enable  and  disable  PEs  is  made  at  execution  time.  Data 
conditional  masks  are  the  implicit  result  of  performing  a  conditional 
branch  dependent  on  local  data  in  an  SIMD  machine  environment,  where 
the  result  of  different  PEs'  evaluations  may  differ.  As  a  result  of  a 
conditional  where  statement  of  the  form 

where  <data-condition>  do  ...  elsewhere  ... 
each  PE  will  set  its  own  flag  to  activate  itself  for  either  the  "do" 
or  the  "elsewhere,"  but  not  both.  The  execution  of  the  "elsewhere" 
statements  must  follow  the  "do"  statements;  i.e.,  the  "do"  and  "else¬ 
where"  statements  cannot  be  executed  simultaneously.  For  example,  as 
a  result  of  executing  the  statement: 

where  A  <  B  do  C  f  A  elsewhere  C  4-  B 
each  PE  will  load  its  C  register  with  the  minimum  of  its  A  and  B  re¬ 
gisters,  i.e.,  some  PEs  will  execute  "C  4-  A,"  and  then  the  rest  will 
execute  "C  4-  B."  This  type  of  masking  is  used  in  such  machines  as  the 
Illiac  IV  [3]  and  PEPE  [7].  "Where"  statements  can  be  nested  using  a 
run-time  control  stack. 


5.  Secondary  Memory  System 


The  Memory  Storage  System  will  consist  of  N/Q  independent  Memory 
Storage  Units,  numbered  from  0  to  (N/Q)-l.  These  devices  will  allow 
fast  loading  and  unloading  of  the  N  double-buffered  PCU  memory  modules 
and  will  provide  storage  for  system  image  data  and  MIMD  programs. 

Each  Memory  Storage  Unit  is  connected  to  Q  PCU  memory  modules.  For 
0  <  i  <  N/Q,  Memory  Storage  Unit  i  is  connected  to  those  memory 
modules  whose  physical  addresses  are  of  the  form  (Q*i)+k,  (Kk<Q.  Re¬ 
call  that,  for  0£k<Q,  MC  k  is  connected  to  those  PEs  whose  physical 
addresses  are  of  the  form  (Q*i)+k,  0  £  i  <  N/Q.  This  is  shown  for 
N*32  and  Q-4  in  Fig.  8. 

For  a  partition  of  size  N/Q,  the  two  main  advantages  of  this  ap¬ 
proach  are  that  (1)  all  of  the  memory  modules  can  be  loaded  in  paral¬ 
lel  and  (2)  the  data  is  directly  available  no  matter  which  partition 
(MC  group)  is  chosen.  This  is  done  by  storing  in  Memory  Storage  Unit 


A  virtual  machine  of  RN/Q  PEs,  1<R<Q,  logically  numbered  from  0  to 
RN/Q-1 ,  requires  only  R  parallel  block  loads  if  the  data  for  the 
memory  module  whose  high-order  n-q  logical  address  bits  equal  i  is 
loaded  into  Memory  Storage  Unit  i.  This  is  true  no  matter  which  group 
of  R  MCs  (which  agree  in  their  low-order  q-r  address  bits)  is  chosen. 

As  an  example ,  consider  Fig.  8,  and  assume  a  virtual  machine  of 
size  16  is  desired.  The  data  for  the  memory  modules  whose  logical  ad¬ 
dresses  are  0  and  1  is  loaded  into  Memory  Storage  Unit  0,  for  memory 
modules  2  and  3  into  unit  1,  etc.  Assume  the  partition  of  size  16  is 
chosen  to  consist  of  the  processors  connected  to  MCs  1  and  3.  Given 
this  assignment  of  MCs,  the  PCU  memory  module  whose  physical  address 
is  2*i+l  has  logical  address  i,  0  <  i  <  16.  The  Memory  Storage  Units 
first  load  memory  modules  physically  addressed  1,  5,  9,  13,  17,  21, 

25,  and  29  (simultaneously),  and  then  load  memory  modules  3,  7,  11, 

15,  19,  23,  27,  and  31  (simultaneously) .  No  matter  which  pair  of  MCs 
is  chosen,  only  two  parallel  block  loads  are  needed.  Thus,  for  a  vir¬ 
tual  machine  of  size  RN/Q,  this  secondary  storage  scheme  allows  all 
RN/Q  memory  modules  to  be  loaded  in  R  parallel  block  transfers,  1<R<Q* 

This  same  approach  can  be  taken  if  only  (N/Q)/2d  distinct  Memory 
Storage  Units  are  available,  where  0  <  d  <  n-q.  In  this  case,  howev¬ 
er,  R2d  parallel  block  loads  will  be  required  instead  of  just  R.  The 
number  and  types  of  devices  that  will  be  used  in  PASM  will  depend  upon 
speed  requirements,  cost  constraints,  and  the  state-of-the-art  of 
storage  technology  at  implementation  time. 

The  Memory  Management  System  is  composed  of  a  separate  set  of  mi¬ 
croprocessors  dedicated  to  performing  tasks  in  a  distributed  fashion, 
i.e.,  one  processor  handles  Memory  Storage  System  bus  control,  one 
handles  the  peripheral  device  I/O,  etc.  This  distributed  processing 
approach  is  chosen  in  order  to  provide  the  Memory  Management  System 
with  a  large  amount  of  processing  power  at  low  cost.  The  division  of 
tasks  chosen  is  based  on  the  main  functions  which  the  Memory  Manage¬ 
ment  System  must  perform,  including:  (1)  generating  tasks  based  on  PCU 
memory  module  load/unload  requests  from  the  System  Control  Unit;  (2) 
scheduling  of  Memory  Storage  System  data  transfers;  (3)  control  of 
input/output  operations  involving  peripheral  devices  and  the  Memory 
Storage  System;  (4)  maintenance  of  the  Memory  Management  System  file 
directory  information;  and  (5)  control  of  the  Memory  Storage  System 
bus  system. 
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Fig.  8.  Organization  of  the  Memory  Storage  System,  shown  for  N-32 
and  Q"4.  "MSU"  is  Memory  Storage  Unit. 


i  the  data  for  a  task  which  is  to  be  loaded  into  the  i-th  logical 
memory  module  of  the  virtual  machine  of  size  N/Q,  0  <  i  <  N/Q.  Memory 
Storage  Unit  i  is  connected  to  the  i-th  memory  module  in  each  MC  group 
so  that  no  matter  which  MC  group  of  N/Q  processors  is  chosen,  the  data 
from  the  i-th  Memory  Storage  unit  can  be  loaded  into  the  i-th  logical 
memory  module,  0  £  i  <  N/Q,  simultaneously.  Thus,  for  virtual 
machines  of  size  N/Q,  this  secondary  storage  scheme  allows  all  N/Q 
memory  modules  to  be  loaded  in  one  parallel  block  transfer. 


Parallel  Computation  of  a  Global  Histogram 


In  this  section,  an  SIMD  algorithm  for  computing  the  global  histo¬ 
gram  of  an  algorithm  is  given  [20].  Assume  there  are  B-2b  bins  in  the 
histogram,  B<N.  An  M  by  M  image  is  represented  by  an  array  of  M2 
pixels  (picture  elements) ,  where  the  value  of  each  pixel  is  assumed  to 
be  a  b-bit  unsigned  integer  representing  one  of  B  possible  gray  lev¬ 
els.  The  B-bin  histogram  of  the  image  contains  a  j  in  bin  i  if  exact¬ 
ly  j  of  the  pixels  have  a  gray  level  of  i,  0^i<B. 

Assume  the  image  is  equally  distributed  among  the  N  PEs  in  PASM, 

2  2 

i.e.,  each  PE  has  H  /N  pixels,  and  B  <  M  /N.  Since  the  image  is  dis¬ 
tributed  over  N  PEs,  each  PE  will  calculate  a  B-bin  histogram  based  on 
2 

its  M  /N  segment  of  the  image.  Then  these  “local"  histograms  will  be 
combined  using  the  algorithm  described  below.  This  algorithm  is 
demonstrated  for  N-16  and  B*4  bins  in  Fig.  9. 

Each  block  of  B  PEs  performs  B  simultaneous  recursive  doublings 
[24]  to  compute  the  histogram  for  the  portion  of  the  image  contained 
in  the  block  in  the  first  b  steps.  At  the  end  of  the  b  steps,  each  PE 
has  one  bin  of  this  partial  histogram.  This  is  accomplished  by  first 
dividing  the  B  PEs  of  a  block  into  two  groups.  Each  group  accumulates 
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Fig.  9.  Histogram  calculation  for  N=16  PEs,  B=4  bins.  (w,...,z) 

denotes  that  bins  w  through  z  of  the  partial  histogram  are 
in  the  PE. 


the  sums  for  half  of  the  bins,  and  sends  the  bins  it  is  not  accumulat¬ 
ing  to  the  group  which  is  accumulating  those  bins.  At  each  step  of 
the  algorithm,  each  group  of  PCs  is  divided  in  half  such  that  the  PEs 
with  the  lower  addresses  form  one  group,  and  the  PEs  with  the  higher 
addresses  form  another.  The  accumulated  sums  are  similarly  divided  in 
half  based  on  their  indices  in  the  histogram.  The  groups  then  ex¬ 
change  sums,  so  that  each  PE  contains  only  sum  terms  which  it  is  accu¬ 
mulating.  The  newly-received  sums  are  added  to  the  sums  already  in 
the  PE.  After  b  steps,  each  PE  has  the  total  value  for  one  bin  from 
the  portion  of  the  image  contained  in  the  B  PEs  in  its  block. 

The  results  for  these  blocks  can  be  combined  in  n-b  steps  to  yield 
the  histogram  of  the  entire  image  distributed  over  B  PEs,  with  the  sum 
for  bin  i  in  PE  i,  (Ki<B.  This  is  done  by  performing  n-b  steps  of  a 
recursive  doubling  [24]  algorithm  to  sum  the  partial  histograms  from 
the  N/B  blocks,  shown  by  the  last  two  steps  of  Pig.  9.  Note  that  B 
recursive  doublings  are  being  performed  simultaneously,  one  for  each 
bin.  A  general  algorithm  to  compute  the  B-bin  histogram  for  an  image 
distributed  over  N  PEs  is  given  in  [20]. 

Now  consider  relative  speeds  of  sequential  and  parallel  computation 
of  the  histogram.  A  sequential  algorithm  to  compute  the  histogram  of 
an  N  by  H  image  requires  M2  additions.  The  SIMD  algorithm  uses  M2/N 
additions  for  each  PE  to  compute  its  local  histogram.  At  step  i  in 
the  merging  of  the  partial  histograms,  0<i<b,  the  number  of  parallel 
data  transfer/adds  required  is  B/2*+1.  A  total  of  B-l  transfer/adds 
are  therefore  performed  in  the  first  b  steps  of  the  algorithm.  Then 
n-b  parallel  transfers  and  additions  are  needed  to  combine  the  block 
histograms.  This  technique  therefore  requires  B-l+n-b  parallel 
transfer/add  operations,  plus  the  M  /N  additions  needed  to  compute  the 
local  PE  histograms.  For  example,  if  N-1024,  M-512,  and  B«128,  the 
sequential  algorithm  would  require  262,144  additions;  the  parallel  al¬ 
gorithm  uses  256  addition  steps  plus  130  transfer/add  steps.  The 
result  of  the  algorithm,  i.e.,  the  histogram,  is  distributed  over  the 
first  B  PEs.  This  distribution  may  be  efficient  for  further  process¬ 
ing  on  the  histogram,  e.g.,  finding  the  maximum  or  minimum,  or  for 
smoothing  the  histogram.  If  it  is  necessary  for  the  entire  histogram 
to  be  in  a  single  PE,  B-l  additional  parallel  data  transfers  are  re¬ 
quired.  Both  the  Cube  and  ADM  multistage  networks  can  perform  all  of 
the  required  inter-PE  data  transfers  efficiently. 


7.  2-D  FFT  Algorithms 


In  this  section,  an  SIMD  algorithm  to  compute  the  2-D  FFT  of  an  im¬ 
age  is  given  [23].  A  standard  approach  to  computing  the  2D-DFT  of  an 
image  S  is  to  perform  the  1-D  DFT  on  the  rows  of  S,  giving  an  inter¬ 
mediate  matrix  G,  and  then  perform  the  1-D  DFT  on  the  columns  of  G. 

The  resulting  matrix  F  is  the  2-D  DFT  of  S.  Suppose  that  an  SIMD 
machine  has  N-M  PEs,  each  of  which  has  one  row  of  an  M  by  M  input  im¬ 
age  S.  An  efficient  method  for  obtaining  F,  the  DFT  of  S,  is  to  per¬ 
form  M  1-D  FFTs  in  parallel  on  the  rows  of  S  to  get  G,  "transpose"  G, 
and  then  perform  M  1-D  FFTs  in  parallel  on  the  columns  of  G  to  get  FT. 
This  is  shown  in  Fig.  10.  (FT  can  be  transposed  to  give  F,  however, 
this  may  not  be  necessary  depending  on  what  further  processing  is  done 
on  F.) 

To  form  the  transpose  of  G,  GT,  such  that  each  row  of  GT  is  in  a 
different  PE,  the  basic  operation  performed  is  the  transfer  of  array 
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Computation  of  2-D  FFT  of  M  bv  M  array  S  using  M  PEs 


element  G(v,w)  from  PE  v  to  PE  w.  This  is  done  for  M  G(v,w)'s  in 
parallel  by  sending  data  from  PE  v  to  PE  (v+i)  mod  M  for  all  of  the 
G(v,w)  for  which  (w-v)  mod  M  =  i.  The  parallel  transfer  operation  is 
performed  for  l<i<M.  For  each  i  value,  the  element  which  PE  v  sends 
is  the  w-th  element  of  the  row  of  G  held  in  PE  v,  where  w  =  (v+i)  mod 
M.  That  element,  received  in  PE  w,  is  stored  as  the  v-th  element  of 
the  column  of  G  being  created  in  PE  w,  where  v  =  (w-i)  mod  M.  The 
elements  on  the  diagonal  G(v,w),  where  v=w,  do  not  have  to  be 
transferred.  Performing  the  transpose  therefore  requires  M-l  parallel 
data  transfers. 

The  serial  complexity  of  2M  1-D  FFTs  (i.e.,  an  M  by  M  2-D  DFT)  is 

2 

M  log2M  "butterflies."  The  above  parallel  implementation  of  the  2-D 
DFT  executes  two  serial  FFT  algorithms  and  has  a  complexity  of  Mlog2M 
butterfly  steps.  Thus,  an  ideal  speedup  of  M  is  achieved  for  butter¬ 
fly  operations  with  a  cost  of  M-l  data  transfers. 

This  approach  can  be  generalized  for  N<M.  For  example,  if  N-M/2 

each  PE  is  given  two  rows  of  the  input  matrix  S.  The  FFTs  on  the  rows 

of  S  are  performed  by  two  serial  FFTs,  executed  one  after  the  other, 

on  the  two  rows  in  each  PE.  This  yields  G,  with  each  PE  having  two 

rows  of  G.  The  second  step  is  to  form  the  transpose  of  G,  GT,  where 

T 

each  PE  has  two  rows  of  G  (i.e.,  each  PE  has  two  columns  of  G) .  If 
PE  i  contains  rows  2i  and  2i+l,  then,  in  general,  G(i,j)  is 
transferred  from  PE  [i/2 J  to  PE  |j/2J  ,  0  <  i,j  <  M.  The  complexity 
associated  with  the  transpose  is  2M-4  parallel  transfers.  The  -4  term 

appears  because  the  diagonal  and  near-diagonal  terms  are  already  in 

the  correct  PE.  The  final  step  is  to  perform  a  1-D  DFT  on  the  columns 

of  G.  This  is  done  by  two  serial  FFTs  in  each  PE,  as  above.  This 

T  T 

gives  F  ,  with  each  PE  having  two  rows  of  F  .  This  implementation  has 

a  complexity  of  four  serial  FFT  algorithms,  or  2Mlog2M  butterfly 

steps.  This  is  the  maximum  possible  reduction  in  the  number  of 

butterfly  steps,  given  M/2  PEs.  The  overhead  associated  with  the 

transpose  is  2M-4  transfers. 

In  general,  when  this  method  is  implemented  on  N  PEs,  N£M,  the  com¬ 
plexity  will  be  derived  directly  from  the  1-D  FFT  algorithm  used.  If 
the  complexity  of  the  serial  1-D  FFT  algorithm  is  C,  then  the  complex¬ 
ity  of  the  2-D  FFT  algorithm  is  2(M/N)C  plus  the  cost  of  computing  the 
transpose.  If  N  *  M/(2r),  the  cost  of  the  transpose  is  2r(M-2r)  data 
transfers.  The  -2r  term  appears  because  before  the  transpose  each  PE 
holds  2r  rows,  and  after  the  transpose  each  PE  holds  2r  columns. 

Thus,  only  M-2r  elements  of  each  row  need  to  be  transferred.  In  all 
cases,  the  necessary  inter-PE  data  transfers  can  be  done  efficiently 
by  the  Cube  and  AOM  multistaoe  networks. 


Table  1.  The  PASM  design  parameters,  based  on  current  plans. 


general 

full 

PASM 

PASM 

prototype 

Number  of  PEs 

N 

1024 

16 

Number  of  network  stages 
(Extra  Stage  Cube) 

log2N  + 1 

11 

5 

Number  of  MCs 

Q 

32 

4 

Number  of  PEs  per  MC 

N/Q 

32 

4 

Number  of  Memory  Storage  Units 

N/Q 

32 

4 

Number  of  Memory  Management 
System  processors 

fixed 

5 

5 

Smallest  size  partition 

N/Q 

32 

4 

Maximum  number  of  partitions 

Q 

32 

4 

8.  Conclusions 


This  paper  provided  an  overview  of  the  PASM  system  and  examples  of 
its  use.  Table  1  summarizes  the  PASM  design  parameters.  In  order  to 
contrast  PASM  to  a  different  approach  to  parallel  image  processing, 
Table  2  compares  the  features  of  CLIP4  [8]  to  the  planned  features  of 
PASM.  A  reading  list  for  further  information  about  PASM  is  provided 
at  the  end  of  this  paper. 

Table  2.  A  comparison  of  the  features  of  CLIP4  and  the  planned  features  of  PASM. 


feature 

CLIP4 

PASM 

Year  built 

1080 

1083/4  ?  (prototype) 

Processor  type 

1-bit,  simple 

32-bit,  complex 
(68000  prototype) 

Memory  size 
per  processor 

32  bits 

64K  words 

Network  type 

8  nearest  neighbors 

multistage 

Number  of  processors 
for  computation 

06*  =  OK 

1024  (16  prototype) 

Image  division 

pixel/ processor 

subimage/PE 

I/O 

shift  by  column, 
rows  in  parallel 

double-buffered  PE  memories, 
multiple  secondary  storage  devices 

1  Modes 

SIMD 

partitionable  STMD/MIMD 

In  conclusion,  the  objective  of  the  PASM  design  is  to  achieve  a 
system  which  attains  a  compromise  between  flexibility  and  cost- 
effectiveness  for  a  specific  problem  domain.  A  dynamically  reconfi- 
gurable  system  such  as  PASM  should  be  a  valuable  tool  for  both  image 
processing/pattern  recognition  and  parallel  processing  research. 
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